Goto

Collaborating Authors

 Fortaleza


Structure-Preserving Multi-View Embedding Using Gromov-Wasserstein Optimal Transport

Eufrazio, Rafael Pereira, Montesuma, Eduardo Fernandes, Cavalcante, Charles Casimiro

arXiv.org Machine Learning

Multi-view data analysis seeks to integrate multiple representations of the same samples in order to recover a coherent low-dimensional structure. Classical approaches often rely on feature concatenation or explicit alignment assumptions, which become restrictive under heterogeneous geometries or nonlinear distortions. In this work, we propose two geometry-aware multi-view embedding strategies grounded in Gromov-Wasserstein (GW) optimal transport. The first, termed Mean-GWMDS, aggregates view-specific relational information by averaging distance matrices and applying GW-based multidimensional scaling to obtain a representative embedding. The second strategy, referred to as Multi-GWMDS, adopts a selection-based paradigm in which multiple geometry-consistent candidate embeddings are generated via GW-based alignment and a representative embedding is selected. Experiments on synthetic manifolds and real-world datasets show that the proposed methods effectively preserve intrinsic relational structure across views. These results highlight GW-based approaches as a flexible and principled framework for multi-view representation learning.


Leadership Detection via Time-Lagged Correlation-Based Network Inference

da Silva, Thayanne França, Maia, José Everardo Bessa

arXiv.org Artificial Intelligence

Understanding leadership dynamics in collective behavior is a key challenge in animal ecology, swarm robotics, and intelligent transportation. Traditional information-theoretic approaches, including Transfer Entropy (TE) and Time-Lagged Mutual Information (TLMI), have been widely used to infer leader-follower relationships but face critical limitations in noisy or short-duration datasets due to their reliance on robust probability estimations. This study proposes a method based on dynamic network inference using time-lagged correlations across multiple kinematic variables: velocity, acceleration, and direction. Our approach constructs directed influence graphs over time, enabling the identification of leadership patterns without the need for large volumes of data or parameter-sensitive discretization. We validate our method through two multi-agent simulations in NetLogo: a modified Vicsek model with informed leaders and a predator-prey model featuring coordinated and independent wolf groups. Experimental results demonstrate that the network-based method outperforms TE and TLMI in scenarios with limited spatiotemporal observations, ranking true leaders at the top of influence metrics more consistently than TE and TLMI.


Evaluating LLMs and Prompting Strategies for Automated Hardware Diagnosis from Textual User-Reports

Caminha, Carlos, Silva, Maria de Lourdes M., Chaves, Iago C., Brito, Felipe T., Farias, Victor A. E., Machado, Javam C.

arXiv.org Artificial Intelligence

Computer manufacturers offer platforms for users to describe device faults using textual reports such as "My screen is flickering". Identifying the faulty component from the report is essential for automating tests and improving user experience. However, such reports are often ambiguous and lack detail, making this task challenging. Large Language Models (LLMs) have shown promise in addressing such issues. This study evaluates 27 open-source models (1B-72B parameters) and 2 proprietary LLMs using four prompting strategies: Zero-Shot, Few-Shot, Chain-of-Thought (CoT), and CoT+Few-Shot (CoT+FS). W e conducted 98,948 inferences, processing over 51 million input tokens and generating 13 million output tokens. W e achieve f1-score up to 0.76. Results show that three models offer the best balance between size and performance: mistral-small-24b-instruct and two smaller models, llama-3.2-1b-instruct


DataMan: Data Manager for Pre-training Large Language Models

Peng, Ru, Yang, Kexin, Zeng, Yawen, Lin, Junyang, Liu, Dayiheng, Zhao, Junbo

arXiv.org Artificial Intelligence

The performance emergence of large language models (LLMs) driven by data scaling laws makes the selection of pre-training data increasingly important. However, existing methods rely on limited heuristics and human intuition, lacking comprehensive and clear guidelines. To address this, we are inspired by ``reverse thinking'' -- prompting LLMs to self-identify which criteria benefit its performance. As its pre-training capabilities are related to perplexity (PPL), we derive 14 quality criteria from the causes of text perplexity anomalies and introduce 15 common application domains to support domain mixing. In this paper, we train a Data Manager (DataMan) to learn quality ratings and domain recognition from pointwise rating, and use it to annotate a 447B token pre-training corpus with 14 quality ratings and domain type. Our experiments validate our approach, using DataMan to select 30B tokens to train a 1.3B-parameter language model, demonstrating significant improvements in in-context learning (ICL), perplexity, and instruction-following ability over the state-of-the-art baseline. The best-performing model, based on the Overall Score l=5 surpasses a model trained with 50% more data using uniform sampling. We continue pre-training with high-rated, domain-specific data annotated by DataMan to enhance domain-specific ICL performance and thus verify DataMan's domain mixing ability. Our findings emphasize the importance of quality ranking, the complementary nature of quality criteria, and their low correlation with perplexity, analyzing misalignment between PPL and ICL performance. We also thoroughly analyzed our pre-training dataset, examining its composition, the distribution of quality ratings, and the original document sources.


Generative Active Adaptation for Drifting and Imbalanced Network Intrusion Detection

Gupta, Ragini, Liu, Shinan, Zhang, Ruixiao, Hu, Xinyue, Kommaraju, Pranav, Wang, Xiaoyang, Benkraouda, Hadjer, Feamster, Nick, Nahrstedt, Klara

arXiv.org Artificial Intelligence

Machine learning has shown promise in network intrusion detection systems, yet its performance often degrades due to concept drift and imbalanced data. These challenges are compounded by the labor-intensive process of labeling network traffic, especially when dealing with evolving and rare attack types, which makes selecting the right data for adaptation difficult. To address these issues, we propose a generative active adaptation framework that minimizes labeling effort while enhancing model robustness. Our approach employs density-aware active sampling to identify the most informative samples for annotation and leverages deep generative models to synthesize diverse samples, thereby augmenting the training set and mitigating the effects of concept drift. We evaluate our end-to-end framework on both simulated IDS data and a real-world ISP dataset, demonstrating significant improvements in intrusion detection performance. Our method boosts the overall F1-score from 0.60 (without adaptation) to 0.86. Rare attacks such as Infiltration, Web Attack, and FTP-BruteForce, which originally achieve F1 scores of 0.001, 0.04, and 0.00, improve to 0.30, 0.50, and 0.71, respectively, with generative active adaptation in the CIC-IDS 2018 dataset. Our framework effectively enhances rare attack detection while reducing labeling costs, making it a scalable and adaptive solution for real-world intrusion detection.


VLMs as GeoGuessr Masters: Exceptional Performance, Hidden Biases, and Privacy Risks

Huang, Jingyuan, Huang, Jen-tse, Liu, Ziyi, Liu, Xiaoyuan, Wang, Wenxuan, Zhao, Jieyu

arXiv.org Artificial Intelligence

Visual-Language Models (VLMs) have shown remarkable performance across various tasks, particularly in recognizing geographic information from images. However, significant challenges remain, including biases and privacy concerns. To systematically address these issues in the context of geographic information recognition, we introduce a benchmark dataset consisting of 1,200 images paired with detailed geographic metadata. Evaluating four VLMs, we find that while these models demonstrate the ability to recognize geographic information from images, achieving up to $53.8\%$ accuracy in city prediction, they exhibit significant regional biases. Specifically, performance is substantially higher for economically developed and densely populated regions compared to less developed ($-12.5\%$) and sparsely populated ($-17.0\%$) areas. Moreover, the models exhibit regional biases, frequently overpredicting certain locations; for instance, they consistently predict Sydney for images taken in Australia. The strong performance of VLMs also raises privacy concerns, particularly for users who share images online without the intent of being identified. Our code and dataset are publicly available at https://github.com/uscnlp-lime/FairLocator.


Data denoising with self consistency, variance maximization, and the Kantorovich dominance

Hiew, Joshua Zoen-Git, Lim, Tongseok, Pass, Brendan, de Souza, Marcelo Cruz

arXiv.org Artificial Intelligence

We introduce a new framework for data denoising, partially inspired by martingale optimal transport. For a given noisy distribution (the data), our approach involves finding the closest distribution to it among all distributions which 1) have a particular prescribed structure (expressed by requiring they lie in a particular domain), and 2) are self-consistent with the data. We show that this amounts to maximizing the variance among measures in the domain which are dominated in convex order by the data. For particular choices of the domain, this problem and a relaxed version of it, in which the self-consistency condition is removed, are intimately related to various classical approaches to denoising. We prove that our general problem has certain desirable features: solutions exist under mild assumptions, have certain robustness properties, and, for very simple domains, coincide with solutions to the relaxed problem. We also introduce a novel relationship between distributions, termed Kantorovich dominance, which retains certain aspects of the convex order while being a weaker, more robust, and easier-to-verify condition. Building on this, we propose and analyze a new denoising problem by substituting the convex order in the previously described framework with Kantorovich dominance. We demonstrate that this revised problem shares some characteristics with the full convex order problem but offers enhanced stability, greater computational efficiency, and, in specific domains, more meaningful solutions. Finally, we present simple numerical examples illustrating solutions for both the full convex order problem and the Kantorovich dominance problem.


A dimensionality reduction technique based on the Gromov-Wasserstein distance

Eufrazio, Rafael P., Montesuma, Eduardo Fernandes, Cavalcante, Charles C.

arXiv.org Machine Learning

Analyzing relationships between objects is a pivotal problem within data science. In this context, Dimensionality reduction (DR) techniques are employed to generate smaller and more manageable data representations. This paper proposes a new method for dimensionality reduction, based on optimal transportation theory and the Gromov-Wasserstein distance. We offer a new probabilistic view of the classical Multidimensional Scaling (MDS) algorithm and the nonlinear dimensionality reduction algorithm, Isomap (Isometric Mapping or Isometric Feature Mapping) that extends the classical MDS, in which we use the Gromov-Wasserstein distance between the probability measure of high-dimensional data, and its low-dimensional representation. Through gradient descent, our method embeds high-dimensional data into a lower-dimensional space, providing a robust and efficient solution for analyzing complex high-dimensional datasets.


Text-to-SQL based on Large Language Models and Database Keyword Search

Nascimento, Eduardo R., Avila, Caio Viktor S., Izquierdo, Yenier T., García, Grettel M., Andrade, Lucas Feijó L., Facina, Michelle S. P., Lemos, Melissa, Casanova, Marco A.

arXiv.org Artificial Intelligence

Text-to-SQL prompt strategies based on Large Language Models (LLMs) achieve remarkable performance on well-known benchmarks. However, when applied to real-world databases, their performance is significantly less than for these benchmarks, especially for Natural Language (NL) questions requiring complex filters and joins to be processed. This paper then proposes a strategy to compile NL questions into SQL queries that incorporates a dynamic few-shot examples strategy and leverages the services provided by a database keyword search (KwS) platform. The paper details how the precision and recall of the schema-linking process are improved with the help of the examples provided and the keyword-matching service that the KwS platform offers. Then, it shows how the KwS platform can be used to synthesize a view that captures the joins required to process an input NL question and thereby simplify the SQL query compilation step. The paper includes experiments with a real-world relational database to assess the performance of the proposed strategy. The experiments suggest that the strategy achieves an accuracy on the real-world relational database that surpasses state-of-the-art approaches. The paper concludes by discussing the results obtained.


Comparison of Neural Models for X-ray Image Classification in COVID-19 Detection

Togni, Jimi, Attux, Romis

arXiv.org Artificial Intelligence

This study presents a comparative analysis of methods for detecting COVID-19 infection in radiographic images. The images, sourced from publicly available datasets, were categorized into three classes: 'normal,' 'pneumonia,' and 'COVID.' For the experiments, transfer learning was employed using eight pre-trained networks: SqueezeNet, DenseNet, ResNet, AlexNet, VGG, GoogleNet, ShuffleNet, and MobileNet. DenseNet achieved the highest accuracy of 97.64% using the ADAM optimization function in the multiclass approach. In the binary classification approach, the highest precision was 99.98%, obtained by the VGG, ResNet, and MobileNet networks. A comparative evaluation was also conducted using heat maps.